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Abstract — The Baum-Welsh algorithm together with its deriva- 
tives and variations has been the main technique for learning 
Hidden Markov Models (HMM) from observational data. We 
present an HMM learning algorithm based on the non-negative 
matrix factorization (NMF) of higher order Markovian statistics 
that is structurally different from the Baum-Welsh and its asso- 
ciated approaches. The described algorithm supports estimation 
of the number of recurrent states of an HMM and iterates the 
non-negative matrix factorization (NMF) algorithm to improve 
the learned HMM parameters. Numerical examples are provided 
as well. 

Index Terms — Hidden Markov Models, machine learning, non- 
negative matrix factorization. 



I. Introduction 

Hidden Markov Models (HMM) have been successfully 
used to model stochastic systems arising in a variety of appli- 
cations ranging from biology to engineering to finance [1 J, |2J, 
El, El, 111, 161. Following accepted notation for representing 
the parameters and structure of HMM's (see fT\, fSl, f9l, fT], 
UO] for example), we will use the following terminology and 
definitions: 

1) is the number of states of the Markov chain underly- 
ing the HMM. The state space is 5 — {S*!, Sn} and 
the system's state process at time t is denoted by xu 

2) M is the number of distinct observables or symbols 
generated by the HMM. The set of possible observables 
is V = {wi, vm} and the observation process at time 
t is denoted by yt- We denote by y'^ the subprocess 

2/ti2/ti+i ■ • -yta; 

3) The joint probabilities 

ay(fc) = P{xt+i = Sj,yt+i = Vk\xt = Si); 

are the time-invariant probabilities of transitioning to 
state Sj at time t+1 and emitting observation Vk given 
that at time t the system was in state Si. Observation 
Vk is emitted during the transition from state Si to state 
Sj. We use A{k) = {aijik)) to denote the matrix of 
state transition probabilities that emit the same symbol 
Vk. Note that A = X^fc ^(^) the stochastic matrix 
representing the Markov chain state process xt. 
The initial state distribution, at time t = 1, is given by 
r = {7i,...,7jv} where 7^ = P{xi = 5i) > and 
7» = 1- 
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Collectively, matrices A{k) and F completely define the HMM 
and we say that a model for the HMM is A = [{A{k) \ 1 < 
k < M},T). 

We present an algorithm for learning an HMM from single 
or multiple observation sequences. The traditional approach 
for learning an HMM is the Baum-Welsh Algorithm (|T| which 
has been extended in a variety of ways by others [TTl, fT2l, 
113|. More recently, a different type of HMM approximation 
algorithm has been proposed by Finesso et al. |[T4l . based 
on stochastic realization techniques. Our approach is related 
to the Finesso approach which is significantly different from 
Baum-Welsh based methods. In particular, our algorithm is in 
the spirit of the Myhill-Nerode (T5\ construction for building 
automata from languages. In that construction, states are 
defined as equivalence classes of pasts which produce the 
same future outputs. In an HMM, the "future" of a state is 
a distribution over future observations. 

At a conceptual level, our algorithm operates as follows. 
We first estimate the matrix of an observation sequence's high 
order statistics. This matrix has a natural non-negative matrix 
factorization (NMF) [[T6l which can be interpreted in terms 
of the probability distribution of future observations given the 
current state of the underlying Markov Chain. Once estimated, 
these probability distributions can be used to directly estimate 
the transition probabilities of the HMM. 

The estimated HMM parameters can be used, in turn, to 
compute the NMF matrix factors as well as the underlying 
higher order correlation matrix from data generated by the 
estimated HMM. We present a simple example in which an 
NMF factorization is exact but does not correspond to any 
HMM. This is a fact that can be established by comparing the 
factors computed by the NMF with the factors computed by 
the estimated HMM parameters. This kind of comparison is 
not possible with other approaches (T^. 

It is important to point out that the optimal non-negative 
matrix factorization of a positive matrix is known to be NP- 
Hard in the general case |17|, so in practice one computes 
only locally optimal factorizations. As we will show through 
examples, the repeated iteration of the factorization and tran- 
sition probability estimation steps improves the factorizations 
and overall model estimation. Details are provided below. 

A. Preliminaries and Notation 

The only input to our algorithm is an observation sequence 
of length T of the HMM, namely: 

Oi-.T = Oi02...0t 
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where Of G V is the HMM output at observation time t. 

We do not assume that the observation time t = 1 coincides 
with the process' initial state so that the initial distribution of 
states is not necessarily governed by F. In fact, at present, 
our algorithm is capable of learning only the ergodic partition 
of an HMM, namely the set of states that are recurrent. 
Consequently, our model of an HMM refers only to the 
transition probability component A = {A{k)}k that identifies 
this ergodic partition (see I.18i . |fT9l for some background on 
this concept). 

Given Oi;t, we construct two summary statistics repre- 
sented as matrices i?*'-" and F^^'^ for positive integers p and 
s. RP'" is simply a histogram of contiguous prefix-suffix 
combinations whose rows are indexed by observations subse- 
quences of length p and columns are indexed by observation 
subsequences of length s. 

If there are M symbols in the observation alphabet, then 
RP''' is an by A/** matrix whose (i,j)th entry is the 
number of times the prefix substring corresponding to i is 
immediately followed by the substring corresponding to j. The 
correspondence between strings and integers is lexicographic 
in our examples below although any other correspondence will 
do as well. 

The matrix F^^'^ is simply RP'" normalized to be row 
stochastic. Specifically, if G = (gi) where gi — J^j ^i 'j 
then = RP;^/g, for g, ^ and F^^ = for g, = 0. 
Rows of RP''', and correspondingly F^'*, are zero if the prefix 
corresponding to the row label is not observed in the data. 
Zero rows of these matrices can be deleted reducing the size 
of the matrices without affecting the algorithm describe below. 
Accordingly, F^'" is constructed to be row stochastic. 

Entry F^'^ is essentially an estimate of P{V\U) the prob- 
ability of observing observation sequence V of length s, 
indexed by v, following observation sequence U of length p, 
indexed by u. 

Note that while R, F and G have exponentially many rows 
and columns with respect to p and s, the actual number of 
nonzero entries in these matrices are bounded above by T so 
that, stored as sparse matrices, they require no more storage 
than the original observation sequence. Note that Baum- Welsh 
methods require storing and repeatedly accessing the original 
observation sequence. 

A simple but key observation about states of an HMM is that 
each state of an HMM induces a probability distribution on 
symbol subsequences of any length s. Specifically, suppose 
an HMM, A, is in state Si„ (having not yet emitted an 
observation in that state) and consider the symbol subsequence 
V = Vj-^ ■ . -Vj^ . Then 



P{V\S.„,s,\)^P{yl+l^v,,v 



is independent of t under the ergodic assumption and can be 
computed from the A(fc)'s according to 



P{V\S.„s,X) = e's.^l[A{jr)l , 



(1) 



r=l 



where denotes the (0, l)-vector whose only nonzero entry 
is in position i and 1 = [1 1 ... 1]'. Call this probability 



distribution on substrings of length s, P(-|S'i, s, A). It is known 
that the distributions P{-\Si,s,X) for p + s > 2A^ - 1 are 
complete characterizations of the ergodic states of the HMM 
with respect to the observables of the HMM ||20l, lfT4l . 

We now focus attention on substrings that precede state 
occupancy in the HMM's underlying Markov chain. Over the 
course of a long observation sequence such as Oi-t, there 
is some probability, P{Si\U,p, X) that the HMM is in state 
i given that we have just observed the length p substring 
U — Vj^ Vj^ ■ ■ - Vj ■ These probabilities can be computed from 
the A(fc)'s according to 



P(^,JC/,P,A) 



(2) 



PiU\p,X) 

where tt is the stationary distribution of the underlying Markov 
chain process and P{U\p, A) — tt' Ylr=i ^(ji-)l- 

Note that formulae (1) and (2) are closely related to com- 
putations arising in the classical Viterbi algorithm [JJ. 

Let U, V be two strings of observations of length p and s 
respectively. Let U and V be identified with integers u and v as 
already explained before so that P{V\U,X) — PP'^,. Assume 
V was emitted after time t and U immediately preceded V. 
We call U the prefix string and V the suffix string. Then by 
applying elementary properties of probability we can write: 



pp,s 



P{yl+l^V\ 2/,%+i = C/,A) 



N 

^P{ylXl = V,xt^S,\yl_^^, 

k=l 
N 

E 

fe=i 



U,X) 



PiV\Sk,s,X)PiSk\U,p,X) 



Consequently we can express the distribution F^.^ 
P{-\U, X) as a mixture 



' 11 ■ 



N 

E 

fe=i 



P{-\Sk,s,X)P{Sk\U,p,X) 



(3) 



If the underlying state process xt is ergodic then in the limit 
as T — > oo relation (O becomes an equality almost surely. As 
a result of the above observations, for sufficiently large p and 
s, the matrix PP'" has the following properties: 

• rank(F^''*) < A^, where N is the minimal number of 
states representing the HMM, A; 

• Each row of PP'^ is a convex combination (mixture) of 
the N generators, P{-\Si, s, A), for i = 1, 2, . . . , N.; 

• Letting D be the N x M" nonnegative matrix whose 
rows are the distributions P{-\Sk,s,X), i.e., Dk.-. = 
P{-\Sk,s,X), for k ^ 1,2,..., TV, we can rewrite © 
as 

PS:- -[PiSi\U,p,X)P{S2\U,p,X) ■■■ PiSN\U,p,X)]*D. 

Consequently, if we let C — {cu,k) be the MP x N 
nonnegative matrix with Cu,k — P{Sk\U,p, X) we can 
write PP'" C * D. Observe that C and D are both 
(row) stochastic. 

• The factorization depends on the model A. Moreover 
factors C and D can be computed directly from A using 
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([T]) and (|2]). Consequently, the size of the smallest model 
compatible with the data is equal to prank(FP^'^), the 
positive rank of F^'". (The positive rank, prank(A), of 
an m X n nonnegative matrix A is the smallest integer 
N such that A factors in the product of two nonnegative 
matrices of dimensions ray. N and N x n respectively.) 
It is known that rank(yl) < prank(A) < min{TO, n} and 
that the computation of prank(A) is NP-hard fVl]. So it 
would appear that in general it is NP-hard to estimate 
N given F^'* even in ideal conditions (T oo) since 
rank(F^' < N. However, it is not obvious how difficult 
it is to estimate when prank(FP''*) < rank(F^''^) in 
the case F^''^ was built from a typical realization of an 
HMM. We discuss an example at the end of this paper 
that illustrates the open problems and challenges. 
To summarize this discussion, note that the matrix F^''^ is 
based on the distribution of length p prefixes and corre- 
sponding length s suffixes and completely characterizes an 
HMM providing p + s > 2N — 1. Its positive rank is equal 
to the minimal number of states in the underlying Markov 
chain. Moreover, an appropriately constructed factorization of 
FP''^ exposes the state transition and emission probabilities of 
the HMM. The algorithm presented below extracts the state 
transition matrices, {A{k)}k from this factorization. In turn, 
as shown above, the A(fc)'s can be used to construct the 
probability distributions over suffixes that generate F^-* and 
so can be used to compute a new factorization. This iteration 
is essentially the basis for our algorithm. 

In the machine learning context, we have access only to a 
finite amount of observation data (T bounded). Consequently 
rank(F^'''*) will be generally higher than N. This requires a 
decision about the HMM's order, N, not unUke that arising 
in principal component analysis (PCA) II2TI to estimate the 
number of components. 

II. The Algorithm 

Based on the above discussion, our algorithm is outlined 
below. Numerical examples with discussions follow the formal 
description. 

1) Compute FP-'^ and G from the input observation data, 
Oi;T, defined above. 

2) Estimate the number of states, N, by analyzing the either 
FP^" or diag(G') H^FP '*, both computed in Step 1. In the 
cases in which prank(FP'*) ~ rank(FP'*) (e.g. when 
rank(F) < 2) one typical way to obtain this estimate is 
to compute the SVD (singular value decomposition) of 
the aforementioned matrices and then observe the rate of 
decrease of the singular values. For T sufficiently large 
a significant gap between the A^'^ and the {N + 1)*"^ 
largest singular value becomes appreciable. Note that 
since prank > rank, an estimate based on the singular 
values is a lower bound for the order of the HMM. 

3) Estimate distributions P{-\Si, s, A), for i — 1,2, . . . , N. 
This step is achieved through the Nonnegative Matrix 
Factorization (NMF) of FP'". This yields FP'" KiC^D 
with Di . « P{-\Si, s, A) as observed before. 

Note that because of the finiteness of T in general 
prank(FP^^) > N. So it is necessary to solve the 



approximate NMF which consists of determining C and 
D of dimensions MP x N and N x respectively that 
minimize Z)/d(F^''*||C * D), where 

Did{K\\W) - Y.^K,i ^ - + ^^^■) 

is the I-divergence function Q (observe that if 
VKl = I'Wl = 1 then DiDiK\\W) = 
Si j -^ij loS-^i,i/^«j SO '^he I-divergence function is a 
generalization of the Kullback-Leibler distance between 
probability distributions). This optimization problem can 
be solved through iterative methods [16j , [22 1 that re- 
quire initial matrices Co, Do and can only be guaranteed 
to converge to local optima. After executing this step, we 
have a locally optimal estimate of the true distributions 
P{-\S,,s,X). 

4) Estimate matrices A{k), k — 1,2, . . . , N, from D. Let 
us consider A{1) = (a^.j (1)), the other matrices are esti- 
mated in a similar manner. Let — Vj^Vj^ ■ ■ ■ Vj^_^ 
be a generic sequence of s — 1 observations. Then by 
marginalization we can write 

M 

F(F(^"1)|5„ s - 1, A) = ^ P{V^'-^^Vk\S„ s. A) . 
fc=i 

Consequently, the conditional distributions over suffixes 
of length s — 1, P{-\Si,s — 1,A), can be estimated 
from D by adding columns of D appropriately. Let 
H be the matrix thus obtained from D so that Hi -, « 
P{-\Si, s — 1, A). Those conditional distributions satisfy 
the following equality for any l/'""!); 

N 

P{vM'~^'^ \S„ s,X)^Y. I^J' A). 

Therefore P(wi • |5,;,.s,A) = Y.'^=^a,^ji^)P{-\S,,s ~ 
1,A) so we can obtain the unknown values ai,j{l) by 
solving the following systems of linear equations: 

A,i:Af=-i = i^l,2,...,N 

where A^^.,{1) = [a.j,i(l) aj,2(l) • • •aj,Ar(l)]. Compactly 
-D:,i:Af=-i = ^(1) * H. As in step 2, because of the 
finiteness of T and working with bounded arithmetic 
precision we need to content ourselves with a solution 
that minimizes some distance (for example, the Li 
norm) between -Di.i:Ms-i and Ai,-{1) * H, for all i. We 
have formulated these problems as linear programming 
problems using the Li norm. 

5) Output estimated HMM A' = {A{k) \ k = 1,2, . . . , N}. 
This algorithm can be iterated using the estimated A' and 
formula ([T]) to compute new matrices Cq and D'q, and then 
restarting from step 3 above with matrices Cq and Dq as initial 
factors in the approximate NMF. In particular: 

6) Compute D'q — [P{j\Si, s, X')]ij using formula ([T])- 

7) Compute Cg by solving the linear programming problem 
PP" ^ Cq* D'q, for a row stochastic Cq. 

8) Set Co := Cq and Dq D'q. 

9) goto 3). 
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Another possibility for step 7) above is to compute Cq 
using formulae (2) and (3) and then use the resulting Cq and 
D'q as initial guesses for the NMF algorithm. We have tried 
this variant but it does not produce significantly different final 
results. 

III. Numerical examples 

We call an HMM "Deterministic" (DHMM) if for each state 
there exists at most one outgoing transition labeled with the 
same observable. We demonstrate our method on a DHMM, on 
an HMM that can be transformed into an equivalent DHMM 
and also on an HMM for which such a transformation does 
not exist. We finally discuss an example that illustrates the 
situation when rank < prank. 

It is important to note that the significant metric for learning 
an HMM is not the extent to which the transition probabilities 
are accurately learned but the extent to which the observation 
statistics are learned. This is a consequence of the fact that 
HMM's with different transition probabilities and different 
numbers of states can produce observations sequences with the 
same statistics so that learning a specific transition probability 
characterization is not a well-posed problem. 

In our examples we measure the accuracy of our estimates 
by computing the I-divergence rate of the finite dimensional 
distributions associated with the observation process of the 
original model from those associated with the observation 
process of the estimated model. Formally, each HMM A 
induces a family of finite dimensional distributions 



N 
i=l 



on sequences of observations of length n, where tt is the 
stationary distribution of the underlying state process. Let A 
and A' be two HMM's with P„ and Q„ their respective induced 
finite dimensional distributions. The I-divergence rate of A 
from A' is defined as 

Did = lim -DiD{Pn\\Qn) 

n^QO TL 

when the limit exists. 

A. A DHMM Example 

Consider the stochastic process described by model Ai = 

({A(o),v4(i)},r = [o i])with 



A(0) 



0.5 





and A{1) 



0.5 





This is sometimes referred to as the "Even Process" 11231 . 112411 . 
We simulated this process and produced a sequence of T = 
1000 observations. Then we ran our algorithm with p — 2 and 

s = 3: 

1) Build F2 3 from data O: 



F = 



0.14 0.13 



0.13 0.14 

0.08 0.07 



0.26 


0.23 
0.17 




0.25 


0.08 




0.24 


0.08 



2) Estimate N = prank(F). Analyze singular values of F: 

[ 0.88 0.48 0.033 0.011 ] . 

This suggests rank(i^) = prank(i^) = 2. 

3) Estimate distributions 3, Ai) and P(-|S'2, 3, Ai) 
by solving argmincxi Did{F\\C * D): 



C 



0.02 0.98 

1 

1 

0.34 0.66 



D 



0.0 0.25 0.24 0.0 0.5 
0.13 0.13 0.25 0.25 0.24 



4) Estimate matrices A{0) and A{1): 



i(0) = 



2.2e 
6.9e 



18 
18 0.51 



0.0077 0.99 
0.49 



After a second iteration of the algorithm the reconstructed 
matrices become: 



i(0) 




5.6e-17 0.51 



Ail) 



0.0077 0.99 
0.49 



The reconstructed model is essentially identical to the original 
one except for state reordering. This result is particularly strik- 
ing when compared with existing techniques. For example, 
Crutchfield et al 1,23,1 demonstrated their Causal-State Splitting 
Reconstruction (CSSR) e-machine reconstruction algorithm on 
the same Even Process. Their result is essentially equivalent 
to ours in accuracy but it required the processing of 10"* 
observations instead of the 10'^ that we use. 

B. An HMM that has an equivalent DHMM 

Consider the model A2 = ({^(0), A(l)}, T = [0 1]) with 



A{Q) 



0.67 0.33 




and A{1) 



We simulated this process and produced a sequence of T = 
10000 observations. Then we ran our algorithm with p = 2 
and s = 3: 

1) Build F2 '3 from data O: 



F = 



0.3 0.15 

0.45 0.22 

0.29 0.16 





0.22 
0.33 
0.22 




0.22 0.11 








0.23 




0.1 





2) Estimate TV. Analyze singular values of F: 

[ 0.86 0.24 0.012 ] 

to estimate N . This suggests again N — 2. 

3) Estimate distributions P(-|S'i, 3, A2) and P(-|S'2, 3, A2). 
Solve argminc^D Did{F\\C * D): 



0.22 0.26 

0.5 

0.26 0.24 

0.17 0.33 



C 



0.6 0.4 

1 

0.59 0.41 
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D = 



0.45 0.22 0.33 
0.07 0.05 0.06 



0.0 
0.55 0.27 








4) Reconstruct matrices A{0) and A{1): 



i(0) 



0.6 0.4 
0.1 0.072 



3.2e-21 
0.83 



After a second iteration of the algorithm the reconstructed 
model becomes A2: 



^(0) 



0.6 
0.1 



0.4 
0.072 



4.2e-21 
0.83 



These computed transition probabilities are different enough 
from the transition probabilities of the original HMM used 
to generate the data but the statistics of the observation 
sequences are very close. Figure [1] shows the accuracy of this 
estimate in terms of the 1-divergence rate of the original model 
from the estimated one. We computed Dj£){Pn\\Qn)/n for 
n = 1, 2, . . . , 15, with P„ (Qn) being the finite dimensional 
probability distributions over sequences of observations of 
length n, emitted by model A2 (A2) in stationary conditions. 
We can observe that this quantity, the divergence rate of 
Pn from Qn, stabilizes to a very small value (smaller than 
2.5 • 10"'') as expected. 

In fact, this example is equivalent to a DHMM model as 
the reader can readily check independently. 

C. An HMM that has no equivalent finite state DHMM 
Consider the model A3 = ({A(0), T = [1 0]) with 



A(0) 



0.5 



0.5 



0.5 
0.5 
0.5 



and A{1) 




0.5 




We simulated this process and produced a sequence of T = 
10000 observations. Then we ran our algorithm with p — A 
and s = 5. After a second iteration of the algorithm the 
reconstructed model becomes A3: 



i(0) 



0.29 0.43 0.28 
0.29 0.18 
0.29 0.71 



0.0 
0.44 0.081 
0.0 



As before. Figure [T| (bottom) shows the accuracy of this 
estimate in terms of the I-divergence rate of the original model 
from the estimated one. 

Observe that this HMM cannot be transformed into an 
equivalent deterministic HMM [25 1. 

D. Discussion of Rank vs Prank 

We first provide an example of a stochastic matrix whose 
prank differs from its rank but that matrix does not represent 
the statistics of any HMM. 



^^16 



110 
10 10 
10 1 
11 



[11111111] 



Divergence rate of the original HIVIM from the estimated HMIVl 



5 10 
Length of observation sequences 

Divergence rate of the original HIVIM from the estimatect HMIVl 




5 10 
Length of observation sequences 



Fig. 1. Accuracy of HMM's A2 (top) and A3 (bottom). Here Pn is tlie 
distribution over sequences of observations of fixed length n induced by 
the original model whereas Qn refers to the estimated model. The sequence 
Dj£){Pn\\Qn)/ri is calculated for increasing values of n. 



where ® is the Kronecker product. We can verify that 
rank(F) — 3 whereas prank(i^) = 4 ll22l . Moreover 
F ^ CD exactly with 



and 



C = 



0.5 





0.5 





0.5 


0.5 














0.5 


0.5 





0.5 





0.5 



D = - 



1 

















1 








1 

















1 



[11111111] 



Assume that F was obtained from a typical sequence of 
observations emitted by an HMM A with 4 states so that 
F^'^ = F = CD. Then it must be that C = [P{Sj\i, 2, A)],j 
and D = [P(A:|S'j, 5, A)]j_fe. Consider the following model 
A = {A(1),A(2)} with 



A{Q) = 



0.5 0.5 


0.5 0.5 






0.5 0.5 



0.5 0.5 



One can verify that A is the only four-state model such that 
D = [P(A:|S'^-, 5, A)]j fc. In fact observe that the system of 
equations defining A in stage 4 of the algorithm admits, in 
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this case, only one solution. Nevertheless, using formula (|2]): 



[P(^,K,2,A)k, = (1/4)* 



" 1 


1 ■ 




■ 1 


1 " 


1 


1 




1 


1 



Consequently no HMM can generate F. 

1) An example of prank > rank for an exact HMM model: 
The following four-state model A is an example of an HMM 
whose induced F"^-^ matrix has rank 3 but positive rank 

prank — 4: 



0.5 
















































, ^(1) 






0.5 














1 














1 


1 












To verify the claim we computed factors C — [P{Sj\i, 
and D = [P(j|5'i, 5, A)]ij, for = 4, using formulae (Q} 
and dU and then obtained F"^'^ = C * D. Then we verified 
numerically that rank(P^'^) — 3. Finally, we applied Lemma 
2.4 in ||22l to confirm that prank(F2'^) = 4. We also 
verified the character of this model by directly applying our 
algorithm to it in order to obtain F'^'^ empirically (for T = 
10000). An analysis of the singular values of F'^'^, namely 
[0.8530 0.4825 0.1799 0.0114], demonsti-ates the difficulty 
of this case. The fourth singular value is nonzero due to 
the finiteness of T. Consequently it is difficult to determine 
whether iV = 3 or iV = 4. 

IV. Open Questions and Future Work 

A crucial issue is the estimation of N, the size of the 
smallest HMM that generates the stream of data. Under ideal 
conditions, (T — > c»), we have seen that N = prank(F^'*). 
However, filtering out "noise" from the empirical matrix F^''^ 
in order to have an accurate estimate of the positive rank is an 
open challenge. Observed that a spectral analysis of F may, 
in general, produce only a lower bound to N. 

A second important issue in our methodology concerns the 
computation of the approximate NMF. Existing methods are 
suboptimal due to the presence of local optima. This problem 
affects the accuracy of the produced estimate at each iteration 
of our algorithm. Consequently it is important to investigate 
convergence properties when stages 3 — 5 of the algorithm are 
iterated with new initial factors Cq , Dq to seed the approximate 
NMF, using Cq and Dq as computed according to steps 6 — 8, 
from model A' that was estimated in the preceding step. 

A third question concerns with properties of F^'* as s 
oo. In other words, can the Asymptotic Equipartition Property 
be applied to distributions P{-\Si, s, A) so that the distribution 
on the "typical" finite suffixes is uniform and the rest of the 
distribution is zero? 

V. Conclusion 

We have presented a new algorithm for learning an HMM 
from observations of the HMM's output. The algorithm is 
structurally different from traditional Baum- Welsh based ap- 
proaches 1 1 1, 1 1 ll, fT2l, fl3 1. It is related to but different from 
recent approaches in stochastic systems realization lfT4l . We 
believe this method opens a new line of algorithm development 



for learning HMM's and has the advantage of a estimating 
the HMM order from spectral properties of the high order 
correlation statistics of the observation sequence. The algo- 
rithm effectively compresses data by summarizing it into a 
statistical matrix. Options for recursively computing the steps 
of the algorithm to achieve on-line algorithms will be explored. 
Additionally, sparse matrix algorithms can be explored for 
space and time efficiency when the underlying matrices are 
large and sparse. 
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